feat(benchmark): Create mock LLM server for use in benchmarks #1403

tgasser-nv · 2025-09-17T19:17:21Z

Summary

This PR adds a Mock LLM and Guardrails and example Content-Safety configuration to use this end-to-end with Guardrails. I have a follow-on PR using Locust to run performance benchmarks on Guardrails on a laptop without any NVCF function calls, local GPUs, or modifications to the Guardrails code.

Description

This PR includes an OpenAI-compatible Mock LLM Fast API app. This is intended to mock production LLMs for performance-testing purposes. The configuration file comes from a .env file, such as below for the Content Safety mock.

MODEL="nvidia/llama-3.1-nemoguard-8b-content-safety"
UNSAFE_PROBABILITY=0.03
UNSAFE_TEXT="{\"User Safety\": \"unsafe\", \"Response Safety\": \"unsafe\", \"Safety Categories\": \"Violence, Criminal Planning/Confessions\"} "
SAFE_TEXT="{\"User Safety\": \"safe\", \"Response Safety\": \"safe\"}"
LATENCY_MIN_SECONDS=0.5
LATENCY_MAX_SECONDS=0.5
LATENCY_MEAN_SECONDS=0.5
LATENCY_STD_SECONDS=0.0

The Mock LLM first decides randomly if it should return a safe response or not, using the UNSAFE_PROBABILITY probability. This determines whether SAFE_TEXT or UNSAFE_TEXT is returned when the model responds. The Mock LLM then samples latency for the response from a normal distribution (parameterized by LATENCY_MEAN_SECONDS and LATENCY_STD_SECONDS), and clips the minimum and maximum values against LATENCY_MIN_SECONDS and LATENCY_MAX_SECONDS respectively.

After waiting, it then responds with the text.

Test Plan

This test-plan shows how the Mock LLM can be integrated with Guardrails seamlessly. As long as we characterize our Nemoguard and Application LLM latency correctly and can represent them with a distribution, we can use this to perform performance testing.

Terminal 1 (Content Safety Mock)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run python run_server.py --port 8000 --config-file configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Using config file: configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2025-10-07 13:35:00 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2025-10-07 13:35:00 INFO: Health check at: http://0.0.0.0:8000/health
2025-10-07 13:35:00 INFO: Serving model with config configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [95977]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Returning ModelSettings: %s model='meta/llama-3.3-70b-instruct' unsafe_probability=0.0 unsafe_text="I can't help with that. Is there anything else I can assist you with?" safe_text="I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities." latency_min_seconds=4.0 latency_max_seconds=4.0 latency_mean_seconds=4.0 latency_std_seconds=0.0
2025-10-07 13:37:21 INFO: Request finished: 200, took 4.020 seconds
INFO:     127.0.0.1:60072 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 2 (Content Safety Mock)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run python run_server.py --port 8001 --config-file configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env
Returning ModelSettings: %s model='nvidia/llama-3.1-nemoguard-8b-content-safety' unsafe_probability=0.03 unsafe_text='{"User Safety": "unsafe", "Response Safety": "unsafe", "Safety Categories": "Violence, Criminal Planning/Confessions"} ' safe_text='{"User Safety": "safe", "Response Safety": "safe"}' latency_min_seconds=0.5 latency_max_seconds=0.5 latency_mean_seconds=0.5 latency_std_seconds=0.0
2025-10-07 13:37:17 INFO: Request finished: 200, took 0.524 seconds
INFO:     127.0.0.1:60070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Returning ModelSettings: %s model='nvidia/llama-3.1-nemoguard-8b-content-safety' unsafe_probability=0.03 unsafe_text='{"User Safety": "unsafe", "Response Safety": "unsafe", "Safety Categories": "Violence, Criminal Planning/Confessions"} ' safe_text='{"User Safety": "safe", "Response Safety": "safe"}' latency_min_seconds=0.5 latency_max_seconds=0.5 latency_mean_seconds=0.5 latency_std_seconds=0.0
2025-10-07 13:37:22 INFO: Request finished: 200, took 0.503 seconds
INFO:     127.0.0.1:60076 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 3 (Guardrails production code)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run nemoguardrails server --port 9000 --config configs/guardrail_configs --default-config-id content_safety_colang1

INFO:     Started server process [96087]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
/Users/tgasser/projects/nemo_guardrails/nemoguardrails/server/api.py:221: UserWarning: No config_id or config_ids provided, using default config_id
  warnings.warn(
INFO:nemoguardrails.server.api:Got request for config None
Entered verbose mode.
13:37:16.148 | Registered Actions ['ClavataCheckAction', 'GetAttentionPercentageAction', 'GetCurrentDateTimeAction', 
'UpdateAttentionMaterializedViewAction', 'alignscore request', 'alignscore_check_facts', 'autoalign_factcheck_output_api', 
'autoalign_groundedness_output_api', 'autoalign_input_api', 'autoalign_output_api', 'call cleanlab api', 'call fiddler faithfulness', 
'call fiddler safety on bot message', 'call fiddler safety on user message', 'call gcpnlp api', 'call_activefence_api', 
'content_safety_check_input', 'content_safety_check_output', 'create_event', 'detect_pii', 'detect_sensitive_data', 
'injection_detection', 'jailbreak_detection_heuristics', 'jailbreak_detection_model', 'llama_guard_check_input', 
'llama_guard_check_output', 'mask_pii', 'mask_sensitive_data', 'pangea_ai_guard', 'patronus_api_check_output', 
'patronus_lynx_check_output_hallucination', 'protect_text', 'retrieve_relevant_chunks', 'self_check_facts', 'self_check_hallucination', 
'self_check_input', 'self_check_output', 'summarize_document', 'topic_safety_check_input', 'validate_guardrails_ai_input', 
'validate_guardrails_ai_output', 'wolfram alpha request']
13:37:17.122 | Event UtteranceUserActionFinished | {'final_transcript': 'what can you do for me?'}
13:37:17.123 | Event StartInternalSystemAction | {'uid': '453f...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartInputRails'}}, 'action_result_key': None, 'action_uid': 'e65b...', 'is_system_action': True}
13:37:17.124 | Executing action create_event
13:37:17.124 | Event StartInputRails | {'uid': '0880...'}
13:37:17.124 | Event StartInternalSystemAction | {'uid': 'ed33...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartInputRail', 'flow_id': '$triggered_input_rail'}}, 'action_result_key': None, 'action_uid': 'ebb1...', 'is_system_action': True}
13:37:17.125 | Executing action create_event
13:37:17.125 | Event StartInputRail | {'uid': '9a9f...', 'flow_id': 'content safety check input $model=content_safety'}
13:37:17.125 | Event StartInternalSystemAction | {'uid': 'f497...', 'action_name': 'content_safety_check_input', 'action_params': {}, 
'action_result_key': 'response', 'action_uid': '3a0a...', 'is_system_action': False}
13:37:17.125 | Executing action content_safety_check_input
13:37:17.127 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': None}

LLM Prompt (4a790..) - content_safety_check_input $model=content_safety
                                                                                                                                         
<**Tim: Snipped the input-rail content-safety prompt for clarity**>

LLM Completion (4a790..)
{"User Safety": "safe", "Response Safety": "safe"}                                                                                       

13:37:17.664 | Output Stats None
13:37:17.665 | LLM call took 0.53 seconds
13:37:17.666 | Event InternalSystemActionFinished | {'uid': '0829...', 'action_uid': '3a0a...', 'action_name': 
'content_safety_check_input', 'action_params': {}, 'action_result_key': 'response', 'status': 'success', 'is_success': True, 
'return_value': {'allowed': True, 'policy_violations': []}, 'events': [], 'is_system_action': False}
13:37:17.668 | Event StartInternalSystemAction | {'uid': 'c51a...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'InputRailFinished', 'flow_id': '$triggered_input_rail'}}, 'action_result_key': None, 'action_uid': '58ff...', 'is_system_action': True}
13:37:17.668 | Executing action create_event
13:37:17.669 | Event InputRailFinished | {'uid': 'a44e...', 'flow_id': 'content safety check input $model=content_safety'}
13:37:17.670 | Event StartInternalSystemAction | {'uid': '1a35...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'InputRailsFinished'}}, 'action_result_key': None, 'action_uid': '45db...', 'is_system_action': True}
13:37:17.671 | Executing action create_event
13:37:17.672 | Event InputRailsFinished | {'uid': '68eb...'}
13:37:17.673 | Event StartInternalSystemAction | {'uid': '0df6...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'UserMessage', 'text': '$user_message'}}, 'action_result_key': None, 'action_uid': '6fbd...', 'is_system_action': True}
13:37:17.674 | Executing action create_event
13:37:17.674 | Event UserMessage | {'uid': '2626...', 'text': 'what can you do for me?'}
13:37:17.675 | Event StartInternalSystemAction | {'uid': '27cf...', 'action_name': 'generate_user_intent', 'action_params': {}, 
'action_result_key': None, 'action_uid': '8722...', 'is_system_action': True}
13:37:17.675 | Executing action generate_user_intent
13:37:17.680 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': ['User:']}

LLM Prompt (79bb7..) - general
                                                                                                                                         
System                                                                                                                                   
Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input 
that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it 
truthfully says it does not know.                                                                                                        
User                                                                                                                                     
what can you do for me?                                                                                                                  


LLM Completion (79bb7..)
I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help 
with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or 
illegal activities.                                                                                                                      

13:37:21.722 | Output Stats None
13:37:21.723 | LLM call took 4.04 seconds
13:37:21.724 | Event BotMessage | {'uid': '165a...', 'text': "I can provide information and help with a wide range of topics, from 
science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text 
summarization. However, I can't assist with requests that involve harm or illegal activities."}
13:37:21.726 | Event StartInternalSystemAction | {'uid': '952a...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartOutputRails'}}, 'action_result_key': None, 'action_uid': 'b7b2...', 'is_system_action': True}
13:37:21.727 | Executing action create_event
13:37:21.727 | Event StartOutputRails | {'uid': '9b33...'}
13:37:21.729 | Event StartInternalSystemAction | {'uid': '9124...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartOutputRail', 'flow_id': '$triggered_output_rail'}}, 'action_result_key': None, 'action_uid': '0a9e...', 'is_system_action': True}
13:37:21.730 | Executing action create_event
13:37:21.730 | Event StartOutputRail | {'uid': '84c7...', 'flow_id': 'content safety check output $model=content_safety'}
13:37:21.733 | Event StartInternalSystemAction | {'uid': 'ca6c...', 'action_name': 'content_safety_check_output', 'action_params': {}, 
'action_result_key': 'response', 'action_uid': '8ae0...', 'is_system_action': False}
13:37:21.734 | Executing action content_safety_check_output
13:37:21.736 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': None}

LLM Prompt (ced97..) - content_safety_check_output $model=content_safety
                                                                                                                                         
<**Tim: Snipped the input-rail content-safety prompt for clarity**>

LLM Completion (ced97..)
{"User Safety": "safe", "Response Safety": "safe"}                                                                                       

13:37:22.248 | Output Stats None
13:37:22.248 | LLM call took 0.51 seconds
13:37:22.249 | Event InternalSystemActionFinished | {'uid': '1e53...', 'action_uid': '8ae0...', 'action_name': 
'content_safety_check_output', 'action_params': {}, 'action_result_key': 'response', 'status': 'success', 'is_success': True, 
'return_value': {'allowed': True, 'policy_violations': []}, 'events': [], 'is_system_action': False}
13:37:22.251 | Event StartInternalSystemAction | {'uid': '5428...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'OutputRailFinished', 'flow_id': '$triggered_output_rail'}}, 'action_result_key': None, 'action_uid': '2bec...', 'is_system_action': 
True}
13:37:22.251 | Executing action create_event
13:37:22.252 | Event OutputRailFinished | {'uid': '92aa...', 'flow_id': 'content safety check output $model=content_safety'}
13:37:22.254 | Event StartInternalSystemAction | {'uid': 'e3e0...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'OutputRailsFinished'}}, 'action_result_key': None, 'action_uid': '1979...', 'is_system_action': True}
13:37:22.254 | Executing action create_event
13:37:22.254 | Event OutputRailsFinished | {'uid': 'eb79...'}
13:37:22.256 | Event StartInternalSystemAction | {'uid': '15e4...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartUtteranceBotAction', 'script': '$bot_message'}}, 'action_result_key': None, 'action_uid': '5e90...', 'is_system_action': True}
13:37:22.256 | Executing action create_event
13:37:22.256 | Event StartUtteranceBotAction | {'uid': 'c53b...', 'script': "I can provide information and help with a wide range of 
topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text 
summarization. However, I can't assist with requests that involve harm or illegal activities.", 'action_uid': '6587...'}
13:37:22.258 | Total processing took 5.14 seconds. LLM Stats: 3 total calls, 5.08 total time, 998 total tokens, 903 total prompt tokens, 
95 total completion tokens, [0.53, 4.04, 0.51] as latencies
INFO:     127.0.0.1:60066 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 4 (Client issuing request)

 ~ curl -X POST http://0.0.0.0:9000/v1/chat/completions \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.3-70b-instruct",
      "messages": [
         {
            "role": "user",
            "content": "what can you do for me?"
         }
      ],
      "max_tokens": 16,
      "stream": false,
      "temperature": 1,
      "top_p": 1
   }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   602  100   334  100   268     53     43  0:00:06  0:00:06 --:--:--    79
{
  "messages": [
    {
      "role": "assistant",
      "content": "I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
    }
  ]
}

Related Issue(s)

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

codecov-commenter · 2025-09-17T19:23:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.88%. Comparing base (2af64d6) to head (d9b73be).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1403      +/-   ##
===========================================
+ Coverage    71.66%   71.88%   +0.22%     
===========================================
  Files          171      174       +3     
  Lines        17020    17154     +134     
===========================================
+ Hits         12198    12332     +134     
  Misses        4822     4822

Flag	Coverage Δ
python	`71.88% <100.00%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
nemoguardrails/benchmark/mock_llm_server/api.py	`100.00% <100.00%> (ø)`
nemoguardrails/benchmark/mock_llm_server/models.py	`100.00% <100.00%> (ø)`
...rdrails/benchmark/mock_llm_server/response_data.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… of this into endpoints

codecov-commenter · 2025-09-22T16:33:54Z

Codecov Report

❌ Patch coverage is 80.81633% with 47 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...guardrails/benchmark/mock_llm_server/run_server.py	0.00%	44 Missing ⚠️
nemoguardrails/benchmark/mock_llm_server/config.py	87.50%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

cparisien · 2025-10-14T13:45:44Z

nemoguardrails/benchmark/__init__.py

@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Should we be updating the copyright date on new files?

glad you pointed this out. We should update our LICENSE.md. I'll open a PR

cparisien · 2025-10-14T13:56:30Z

nemoguardrails/benchmark/mock_llm_server/response_data.py

+
+def get_latency_seconds(config: ModelSettings, seed: Optional[int] = None) -> float:
+    """Sample latency for this request using the model's config
+    Very inefficient to generate each sample singly rather than in batch


Is this a comment about the sampling method here? All you're doing is generating random numbers. Or are you saying that because batch=1 and batch=n are very different in real inference, we are not sampling realistically?

cparisien · 2025-10-14T13:58:42Z

tests/benchmark/test_mock_response_data.py

+#
+# class TestGetResponse:
+#     """Test the get_response function."""


Remove commented code?

Pouyanpi · 2025-10-14T14:56:52Z

nemoguardrails/benchmark/mock_llm_server/api.py

+    return response
+
+
+@app.post("/v1/completions", response_model=CompletionResponse)


are you using completions in your benchmarking? If not, I think it is better not to support this legacy interface (https://platform.openai.com/docs/api-reference/completions/create)

Pouyanpi

Looks good, thank you Tim.

Please have a look at my comment about completion interface.

Also made some changes to fix run_server.py code coverage in review pr.

tgasser-nv added 2 commits September 17, 2025 10:45

Initial scaffold of mock OpenAI-compatible server

1bb4443

Refactor mock LLM, fix tests

d9b73be

tgasser-nv self-assigned this Sep 17, 2025

tgasser-nv added 5 commits September 17, 2025 16:59

Added tests to load YAML config. Still debugging dependency-injection…

9021b81

… of this into endpoints

Move FastAPI app import **after** the dependencies are loaded and cached

687e33b

Remove debugging print statements

c0afd8d

Temporary checkin

e62f394

Add refusal probability and tests to check it

6ddcaca

tgasser-nv added 7 commits October 1, 2025 20:55

Use YAML configs for Nemoguard and app LLMs

3b3f49a

Add Mock configs for content-safety and App LLM

f142c0f

Add async sleep statements and logging to record request time

a18b514

Change content-safety mock to have latency of 0.5s

6beb888

Add unit-tests to mock llm

c056b3b

Check for config file

4104a1f

Rename test files to avoid conflicts with other tests

1cca2ff

tgasser-nv assigned Pouyanpi, cparisien and tgasser-nv and unassigned Pouyanpi, cparisien and tgasser-nv Oct 7, 2025

tgasser-nv requested review from Pouyanpi and cparisien October 7, 2025 18:44

Remove example_usage.py script and type-clean config.py

e87715c

cparisien approved these changes Oct 14, 2025

View reviewed changes

Pouyanpi reviewed Oct 14, 2025

View reviewed changes

Pouyanpi approved these changes Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): Create mock LLM server for use in benchmarks #1403

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Uh oh!

tgasser-nv commented Sep 17, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 17, 2025

Uh oh!

codecov-commenter commented Sep 22, 2025 •

edited

Loading

Uh oh!

cparisien Oct 14, 2025

Uh oh!

Pouyanpi Oct 16, 2025

Uh oh!

cparisien Oct 14, 2025

Uh oh!

cparisien Oct 14, 2025

Uh oh!

Pouyanpi Oct 14, 2025

Uh oh!

Pouyanpi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,14 @@
		# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		return response


		@app.post("/v1/completions", response_model=CompletionResponse)

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Are you sure you want to change the base?

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Uh oh!

Conversation

tgasser-nv commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Test Plan

Terminal 1 (Content Safety Mock)

Terminal 2 (Content Safety Mock)

Terminal 3 (Guardrails production code)

Terminal 4 (Client issuing request)

Related Issue(s)

Checklist

Uh oh!

codecov-commenter commented Sep 17, 2025

Codecov Report

Uh oh!

codecov-commenter commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cparisien Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Pouyanpi Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

cparisien Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

cparisien Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Pouyanpi Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Pouyanpi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tgasser-nv commented Sep 17, 2025 •

edited

Loading

codecov-commenter commented Sep 22, 2025 •

edited

Loading